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Abstract 

We address the issue of context tree estimation in variable length hidden Markov models. We propose an estimator 
of the context tree of the hidden Markov process which needs no prior upper bound on the depth of the context tree. 
We prove that the estimator is strongly consistent. This uses information-theoretic mixture inequalities in the spirit 
of QJ, (2). We propose an algorithm to efficiently compute the estimator and provide simulation studies to support 
our result. 

Index Terms 
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I. Introduction 

A variable length hidden Markov model (VLHMM) is a bivariate stochastic process (X ni Y n ) n >o where (X„)„>rj 
(the state sequence) is a variable length Markov chain (VLMC) in a state space X and, conditionally on (X„) n > , 
(Yn)n>o is a sequence of independent variables in a state space Y such that the conditional distribution of Y n 
given the state sequence (called the emission distribution) depends on X n only. Such processes fall into the general 
framework of latent variable processes, and reduce to hidden Markov models (HMM) in case the state sequence 
is a Markov chain. Latent variable processes are used as a flexible tool to model dependent non-Markovian time 
series, and the statistical problem is to estimate the parameters of the distribution when only (Y n ) n >o is observed. 
We will consider in this paper the case where the hidden process may take only a fixed and known number of 
values, that is the case where the state space X is finite with known cardinality k. 

The dependence structure of a latent variable process is driven by that of the hidden process (X n ) n >o, which is 
assumed here to be a variable length Markov chain (VLMC). Such processes were first introduced by Rissanen in 
J3] as a flexible and parsimonious modelization tool for data compression, approximating Markov chains of finite 
orders. Recall that a Markov process of order d is such that the conditional distribution of X n given all past values 
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depends only on the d previous ones X n _i, . . . , X n _<j. But different past values may lead to identical conditional 
distributions, so that all k d possible past values are not needed to describe the distribution of the process. A VLMC 
is such that the probability of the present state depends only on a finite part of the past, and the length of this 
relevant portion, called context, is a function of the past itself. No context may be a proper postfix of any other 
context, so that the set of all contexts may be represented as a rooted labelled tree. This set is called the context 
tree of the VLMC. 

Variable length hidden Markov models appear for the first time, to our knowledge, in movement analysis [4], [5|. 
Human movement analysis is the interpretation of movements as sequences of poses. |5) analyses the movement 
through 3D rotations of 19 major joints of human body. Wang and al. then use a VLHMM representation where X n 
is the pose at time n and Y n is the body position given by the 3D rotations of the 19 major points. They argue that 
"VLHMM is superior in its efficiency and accuracy of modeling multivariate time-series data with highly-varied 
dynamics". 

VLHMM could also be used in WIFI based indoor positioning systems (see |6|). Here X n is a mobile device 
position at time n and Y n is the received signal strength (RSS) vector at time n. Each component of the RSS vector 
represents the strength of a signal sent by a WIFI access point. In practice, the aim is to estimate the positions 
of the device (X n ) n > on the basis of the observations (Y n ) n > . The distribution of Y n given X n — x for any 
location x is beforehand calibrated for a finite number of locations (Li, Lk). A Markov chain on the finite 
set (Li, ...,Lfc) is then used to model the sequence of positions (X n ) n >Q. Again VLHMM model would lead to 
efficient and accurate estimation of the device position. 

The aim of this paper is to provide a statistical analysis of variable length hidden Markov models and, in particular, 
to propose a consistent estimator of the context tree of the hidden VLMC on the basis of the observations (Y n ) n >o 
only. We consider a parametrized family of VLHMM, and we use a penalized likelihood method to estimate the 
context tree of the hidden VLMC. To each possible context tree r, if r is the set of possible parameters, we 
define 

f„ = argmin <^ - sup \ogg 9 {Yi; n ) + pen(n, t) \ 
r I eee T J 

where ge(yi-. n ) is the density of the distribution of the observation Y\ vn = (Y\, . . . ,Y n ) under the parameter 9 
with respect to some dominating positive measure, and pen(n,T) is a penalty that depends on the number n of 
observations and the context tree r. Our aim is to find penalties for which the estimator is strongly consistent 
without any prior upper bound on the depth of the context tree, and to provide a practical algorithm to compute 
the estimator. 

Context tree estimation for a VLHMM is similar to order estimation for a HMM in which the order is defined as 
the unknown cardinality of the state space X. The main difficulty lies in the calibration of the penalty, which requires 
some understanding of the growth of the likelihood ratios (with respect to orders and to the number of observations). 
In particular cases, the fluctuations of the likelihood ratios may be understood via empirical process theory, see 
the recent works J7| for finite state Markov chains and |8| for independent identically distributed observations. 
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Latent variable models are much more complicated, see for instance |9j where it is proved in the HMM situation 
that the likelihood ratio statistics converges to infinity for overestimated order. We thus use an approach based on 
information theory tools to understand the behavior of likelihood ratios. Such tools have been successfull for HMM 
order estimation problems and were used in [j2), IjTJ for discrete observations and in fit)) for Poisson emission 
distributions or Gaussian emission distributions with known variance. Our main result shows that for a penalty of 
form C(t) log n, t n is strongly consistent, that is converges almost surely to the true unknown context tree. Here, 
C(t) has an explicit formulation but is slightly bigger than (k — l)|r|/2 which gives the popular BIC penalty. We 
study the important situation of Gaussian emissions with unknown variance, and prove that our consistency theorem 
holds in this case. 

Computation of the estimator requires computation of the maximum likelihood for all possible context trees. As 
usual, the EM algorithm may be used to compute the maximum likelihood estimator for the parameters when the 
context tree is fixed. We then propose an algorithm to compute the estimator, which prevents the exploration of a 
too large number of context trees. In general the EM algorithm needs to be run several times with different initial 
values to avoid local extrema traps. In the important situation of Gaussian emissions, we propose a way to choose 
the initial parameters so that only one run of the EM algorithm is needed. Simulations compare penalized maximum 
likelihood estimators of the context tree r of the hidden VLMC using our penalty and using BIC penalty. 

The structure of this paper is the following. Section [II] describes the model and gives the notations. Section III 



presents the information theory tools we use, states the main consistency result and applies it to Poisson emission 
distributions and Gaussian emission distributions with known variance. Section |IV] proves the result for Gaussian 
emission distributions with unknown variance. In section [V] we describe the algorithm to compute the estimator 
and we give the simulation results. The proofs that are not essential at first reading are detailed in the Appendix. 



II. Basic setting and notation 

Let X be a finite set whose cardinality is denoted by |X| = k, that we identify with {1, . . . , k}. Let be the 
finite collection of subsets of X. Let Y be a Polish space endowed with its Borel sigma-field Jy. We will work 
on the measurable space (O, F) with !!=(Xx Y) N and T = ® JV)® N • 



A. Context trees and variable length Markov chains 

A string s — XkXk+i---xi € X' _fc+1 is denoted by Xk-i and its length is then l(s) = I — k + 1. We call letters 
of s its components Xi, i = k, . . . ,1. The concatenation of the strings u and v is denoted by uv. A string v is a 
postfix of a string s if there exists a string u such that s = uv. 

A set r of strings and possibly semi-infinite sequences is called a tree if the following tree property holds : no 
s € t is postfix of any other s' € r. A tree t is irreducible if no element s € r can be replaced by a postfix without 
violating the tree property. It is complete if each node except the leaves has |X| children exactly. We denote by 
d(r) the depth of r : rf(r) = max {/(s) | s € r}. 
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Let now Q be the distribution of an ergodic stationary process (X n ) ne % on (X z , J 7 ® ), and for any m < n and 

any x m:n in X n ~ m+1 , write Q(x m:n ) for Q(X 0: „_ m = x m:n ). 

Definition 1. Let r be a tree, r is called a Q-adapted context tree if for any string s in t such that Q(s) > 0: 

\fx Q G X, Q{X = Xo|X_oo : _l = X_oo:_l) = Q(X = Xq\X_ 1{s) ._ 1 = s) (1) 

w/zenever s is postfix of the semi infinite sequence :r-oo:-i- Moreover, if for any s G r, Q(s) > and no proper 
postfix of s has the property ([I]), then r is called the minimal context tree of the distribution Q, and [X n ) n ^i is 
called a variable length Markov chain (VLMC). 

If a tree r is Q-adapted, then for all sequences a;_ oo: _i such that for any M > 1, Q(£-M:-i) > 0, there exists 
a unique string in r which is postfix of £-oo:-i- We denote this postfix by r(x_ 00: _i). 

A tree t is said to be a subtree of t' if for each string s' in r' there exists a string s in r which is postfix of s'. 
Then if r is a Q-adapted tree, any tree r' such that r is a subtree of r' will be Q-adapted. 

Definition 2. Lef Q fee f/ze distribution of a VLMC (X n ) ne z- Let tq be its minimal context tree. There exists a 
unique complete tree r* such that tq is a subtree of t* and 

\t*\ — min {|t| : t is a complete tree and Tq is a subtree o/r}. 

t* is called the minimal complete context tree of the distribution Q of the VLMC (X n ) ne %. 

Let us define, for any complete tree r, the set of transition parameters: 

Qt,r = { {Ps,i) seT , ieX : Vs G t, Vi G X, P Stl > and = l|. 

If (^n)nez is a VLMC with minimal complete context tree r* and transition parameters d\ — \Pti) s T * i e x e 
Qt.r*, for any complete tree r such that r* is a subtree of r, there exists a unique ( = (Ps,i) seT ie x ^ ^t, 1- triat 
defines the same VLMC transition probabilities, namely: for any s G r, there exists a unique it G r* which is a 
postfix of s, and for all i G X, P s ^ = P*^ Of course, a parameter in Q t . T might be not sufficient to define a 
unique distribution of a VLMC (if there is no unique stationary distribution). But the parameter defines a unique 
distribution of VLMC if, for instance, the Markov chain ([X n _M T \ + i, . . . , X n ]) ne z it defines is irreducible. 

B. Variable length hidden Markov models 

A variable length hidden Markov model (VLHMM) is a bivariate stochastic process (X n , Y n ) n >o where (X n ) n >o 
(the state sequence) is a (non observed) stochastic process which is the restriction to non negative indices of a VLMC 
(X n ) n( zz with values in X and, conditionally on (X n ) n >o, (Y n ) n >o is a sequence of independent variables in the 
state space Y such that for any integer n, the conditional distribution of Y n given the state sequence (called the 
emission distribution) depends on X n only. 
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We assume that the emission distributions are absolutely continuous with respect to some positive measure jj, on 
(Y, JV) and are parametrized by a set of parameters 6 e C (R d ') k x M m % so that the set of emission densities (the 
possible densities of the distribution of Y n conditional to X n = x) is {{ge e ^.v(-))xex, 9 e = (0 e ,ii . . . ,9 e ^,i]) e 
6 e }. For any complete tree r, we define now the parameter set : 



©r = ©t.r X e e , 

and define, for 9 = (9 U 9 e ) e 9 T , F g the probability of the VLHMM (X n , Y n ) n > such that (X n ) neZ is the VLMC 
with complete context tree r, transition parameter 9 t , and for any (wi,M2) G N 2 , ui < u 2 , any sets A Ul , . . . ,A U2 
in J>, any x Ui:U2 



" 2 r t 

Q / 9e e , Xu ,r,{y)dfJ,(y) 

U= Ul <- JA U 



Of course, as noted before, it can happen that 9 t does not define a unique VLHMM. We shall however do not 
consider this question since we shall assume that the true parameter defines an irreducible hidden VLMC, and we 
shall introduce initial distributions to define a computable likelihood: throughout the paper we shall assume that the 
observations (Y\, ...,Y n ) = Y\. n come from a VLHMM with parameter 9* such that r* is the minimal complete 
context tree of the hidden VLMC, and such that ([X n _ d ( T *- j+1 , . . . , X n ]) ne z is a stationary and irreducible Markov 
chain. And to define a computable likelihood, we introduce, for any positive integer d, a probability distribution 
I'd on X d so that, for any complete tree t and any 9 = (9 t ,9 e ) e 6 T , we set what will be called the likelihood: 



Vyi:„ G Y", g e (yv. n ) - 
where, if 9 t = {P s , x ) seTtXeX : 

ge t (xi:n) = 



E 



Y[9e e , Xi , v (yi) 



9e t {xi:n) 



(2) 



E 

x_ d(T)+1:0 ex<*(^ 



Vd{r) {X-d(r) + l:0) ]J P r(x- d{T) + ^- l ),x l 



(3) 



We are concerned with the statistical estimation of the tree t* using a method that involves no prior upper bound 
on the depth of r*. Define the following estimator of the minimal complete context tree r* : 



T n = argmin - sup log g e (Y 1:n ) + pen(n, r) 

t complete tree ^ 0£O T 



(4) 



where pen(n, r) is a penalty term depending on the number of observations n and the complete tree r. 
The label switching phenomenon occurs in statistical inference of VLHMM as it occurs in statistical inference of 
HMM and of population mixtures. That is: applying a label permutation on X does not change the distribution of 
(Y n )n>o- Thus, if a is a permutation of {1, k} and r is a complete tree, we define the complete tree cr(r) by 

<t(t) = {a{x 1 )...u{xi)\ x xd e t} . 



Definition 3. If t and t' are two complete trees, we say that t and t' are equivalent, and denote it by r ~ t', if 
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there exists a permutation a o/X such that cr(r) = r'. 

We then choose pen(n,T) to be invariant by permutation, that is: for any permutation a of X, pen(n, cr(r)) = 
pen(n,T). In this case, for any complete tree r, 

- sup logse(Yi:n) +pen(n,r) = - sup log (Y hn ) + pen(n, <t(t)) 

so that the definition of f„ requires a choice in the set of minimizers of Q. 

Our aim is now to find penalties allowing to prove the strong consistency of f n , that is such that f n ~ t* , Vg*- 
eventually almost surely as n — » oo. 

III. The general strong consistency theorem 

In this section, we first recall the tools borrowed from information theory, and set the result that we use in order 
to find a penalty insuring the strong consistency of f„. Then we give our general strong consistency theorem, and 
straightforward applications. Application to Gaussian emissions with unknown variance, which is more involved, 
is deferred to the next section. 

A. An information theoretic inequality 

We shall introduce mixture probability distributions on Y™ and compare them to the maximum likelihood, in the 
same way as fTT| first did; see also fl2) and (T3| for tutorials and use of such ideas in statistical methods. For any 
complete tree r, we define, for all positive integer n, the mixture measure KT™ on Y™ using a prior 7r n on T : 

■K n {d6) = n t (d6 t ) ® i%(d8 e ) 

where 7r" is a prior on Q e that may change with n, and ir t the prior on <d t such that, if 9 t = {P s ,i)seT.iex, 

where (7r s ) seT are Dirichlet V(-,...,-) distributions on [0, 1] |X| . Then KT™ is defined on Y" by 



CT; i (y 1: „)= J2 KT T)t (a; 1:n )KT?(tf lin |a! lin ) 



where 



and 



^e(yi:n\xi:n) = 

e 



.i=l 



d(r) „ /l\ d(T ' r k 

J & Wo t (x d{T)+1 .Jx 1:d(T) )7r t {de t )=\-) J] J U P 



lT Tit (x 1:n ) = ( - ) I P 9t {x diT)+1:n \x 1:d(T) ) Tr t (d9 t ) =[-) | I J | | P^-'irMPsahex) 

r [0,l]|: 

where a^(xi :n ) is the number of times that x appears in context s, that is a x s {xi- n ) = Yl^dM+i lx 4 =x,x i _ i( ,) 4 _j= 
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The following inequality will be a key tool to control the fluctuations of the likelihood. 

Proposition 1. There exists a finite constant D depending only on k such that for any complete tree r, and any 

yi,a G Y": 

k- 1, 



< sup logge(yi-.n) - logKT"(j/i :n ) < sup 



|t| log' n + D 



Proof: Let r be a complete tree. For any 9 £ T , 

9${yi:n) _ xi : „ 



nr?(Vi:n) E KT r (a: l!n )K'II7( 1 /i in |a:i. B ) 



< max 



9t (xi:n)Ui=l9e B , x ., V (yi) 



Thus, 



, 9e(yi-.n) , 
KT T (y 1:n ) Xlsn 



xi, n KT T (xu n )KT e (yi :n \xi.. n ) 
iogn^e.^.nC^) " logKT^(y i:n |a;i :n ) + \t\j( j- ) + d(r)logfc 



where j(x) 



k - 1 



logx + logfc, using [13 



. Then 



, 9e{yi:n) . 

l0g KT^)-^ 



log Y\.98e,* t ,v(yi) - logKT"(yi : „|a;i : „) 



k - 1 



t| logn + -D( T ) 



k — 1 It I — fc 

where -D(t) = |t| log |r| + |t| logfc + d(r) logfc. Now, since r is complete, d(r) < — , so that 

2 k — 1 



I? 



(t) < |T|(logfc-^log|T| 



M-fc 
fc- 1 



log A;. 



But the upper bound in the inequality tends to — oo when \t\ tends to oo, so that there exists a constant D depending 
only on k such that for any complete tree r, D(t) < D, ■ 



B. Strong consistency theorem 

Let 9* = (0*, 6*) with 9* = (P* 4 ) ser%ieX , and 9* = (6* el , 9* k , rf) be the true parameters of the VLHMM. 



Let us now define for any positive a, the penalty: 



pen a (n,T) = 



M 

E 



(fc-l)t + a 



loe 



(5) 



Notice that the complexity of the model is taken into account through the cardinality of the tree t. 
We need to introduce further assumptions. 

• (Al). The Markov chain ((X n _Mr*)+ii ■ ■ ■ > ^n))n>d(T*) i s irreducible. 

• (A2). For any complete tree r such that |r| < |r*| and which is not equivalent to r*, for any 9 £ T , 
the random sequence (0 e ,x„)riez where (X n ) n( zz is a VLMC with transition probabilities 9 t , has a different 
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distribution than (8* x )„ e g where (X n ) ne % is a VLMC with transition probabilities Q\. 

(A3). The family {gg s ,9 e £ e ,} is such that for any probability distributions {ai)i = i : ....k and (c^)i=i,....fc 

on {1, ...,&}, any (0 X , . . . , fc) 77) € 9 e and (^,...,^,7/) e 9 e , if 

i=l i=l 

then, 



aiSg i = a^fSg' and 77 = 77' 
i=i i=i 

(A4). For any y E Y, 9 e 1 — >■ gg c {y) = (fffe i.»?(y))iex is continuous and tends to zero when ||# e || tends to 
infinity. 



(A5). For any i eX, Eg 



< 00. 



(A6). For any 9 e G 9 e , there exists <5 > such that : Eg 



sup (log#0' e (Yi)) H 
\\e< e -e e \\<6 



< 00 



Theorem 1. Assume that (Al) to (A6) hold, and that moreover there exists a positive real number b such that 



sup sup 

9 e ee e xun 



log J] :n |^l:n) 



i=l 



< b log 71 



(6) 



P#* - eventually almost surely. If one chooses a > 2(6 + 1) in the penalty Q, then f„ ~ r*, Pg* - eventually 
almost surely. 



Notice that, to apply this theorem, one has to find a sequence of priors 7r" on e such that (|6]l holds. The 
remaining of the section will prove that it is possible for situations in which priors may be defined as in previous 
works about HMM order estimation, while in the next section, we will prove that it is possible to find a prior in 
the important case of Gaussian emissions with unknown variance. 

In the following proof, the assumption (j6j) insures that |f„| < |r*| eventually almost surely, while assumptions 
(Al-6) insure that for any complete tree r such that |t| < |r*| or |r| = |r*| and r ^ r*, f„ 7^ t* Pg* - eventually 
almost surely. In particular (A2) holds whenever 9* x 7^ 9* if (x, y) G X 2 and x 7^ y. 



Proof: The proof will be structured as follow : we first prove that Fg* - eventually almost surely, |f„| < \t*\. 
We then prove that for any complete tree r such that |r| < |t*| and r ^ T * , f n t Vg* - eventually almost 
surely. This will end the proof since there is a finite number of such trees. For any n E N, we denote by E n the event 



E n 



c e6 e Xl-n 



sup sup ( logJ^5e c ,^,^(^) - logKT"(Fi : „|a;i :n ) j < blogn 



(=i 
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By using ^ and Borel-Cantelli Lemma, to get that Pg» - eventually almost surely, |f n | < |r*|, it is enough to 
show that 

oo 

E p ^{(i^i>i^i)D £; «} <CX} - 



n=l 



Let r be a complete tree such that |r| > |r*|. Using Proposition [I] 

Pe* {(f„ = r*)f)E n } < P e *| ^ sup log ge(Y 1:n ) - pen a (n,r) > log 3e * (Y 1:n ) - pen Q (n, r*)^ j 

n 

logKT; i (y 1:n ) + sup sup [ log JJ ^,,(3^) - logKT^(F 1: „|x 1:n )] 



with 



But 



-— — 1 |t| log ?i + -D - log3 e *(Yi : „) +pen a {n,T*) -pen a (n,T) > ) 



< P e * (F 1: „) < KT?(Yi :n )} exp (e T) „) 



fc — 1 

e-r,n, = — „ — |t| logn + b log n + D + pen a (n, t*) — pen a {n, t). 



k - 1 L-4 (fc — l)f + a 4-1 (fc — l)t + a 

e T ,n = — ^— |t| logn + 61ogn + D + s gn ~~ 2 g " 

t=i t=i 



fc - 1 . . , , . „ (k—l)t + Ot. 

— - — \t\ logn + ologn + D — ^ logn 



M 

E 

i=\t*\+1 

^ I T — ! 7-* \ ]()<'■ II 4- II 1( i" n 



< M - |t*| ) logn + 61og?i + D, 



so that 



{(rn = r*)n^n}< 



e 



M-|t I logn+&logn+_D 



^ -o(l T |-| T *D+ 6 

C.n £ 



for some constant C. Thus 



{(|f„|>|r*|)fl^}^ C E CT(t)n~ 2 



(t-|T*|)+6 

'JT(t)n ^ 

t=|r*| + l 



where CT(t) is the number of complete trees with £ leaves. But using Lemma 2 in |14|, CT(t) < 16* so that 

00 

IV {(|r„| > |r*|)f|S„} < Cn b 16l T *l^ [lGn""/ 2 ]* 

t=i 

= (9(n-"/ 2+b ) 
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which is summable if a > 2(6 + 1). 



Let now r be a tree such that |r| < \r*\ and r ^ r*. Let tm be a complete tree such that r and r* are both a 
subtree of tm- Then, by setting for any integer n > <1(tm) — 1, W n = [X n _,j( TM ) +1: „], for any 9 6 r U T *, 
(W n ,Y n ) ne z is a HMM under Pg. Following the proof of Theorem 3 of fl5| , we obtain that there exists K > 
such that Pg* -eventually a.s., 

- logge*(Yi : „) - sup - log g e (Y 1:n ) > K 
n ge e T n 

so that 

log gg* (Yi :n ) — pen(n,T*) — sup loggg(Y 1:n ) + pen(n,T) > 0,Pg* -eventually a.s., 
which finishes the proof of Theorem [TJ ■ 

C. Gaussian emissions with known variance 

Here, we do not need the parameter r\ so we omit it. Then e = {8 e = (mi, . . . ,rrifc) € M fc }. The conditional 
likelihood is given, for any 6 e = (m x ) xe x by 

V 2iro~ 2cH 

Proposition 2. Assume (Al-2). If one chooses a > k + 2 in the penalty f„ ~ r*, Vg* - eventually a.s. 
Proof: 

The identifiability of the Gaussian model (A3) has been proved by Yakowitz and Spragins in fl6) , it is easy to 
see that Assumptions (A4) to (A6) hold. Now, we define the prior measure 7r™ on O e as the probability distribution 
under which 9 e — (mi, m^) is a vector of k independent random variables with centered Gaussian distribution 
with variance r 2 . Then, using |lOJ, Pg* -eventually a.s., 



sup max 



log ft $9..., - logKT^(y 1:n |x 1:n ) 



f\ , , 72-TV, , k o , 

< olog(l + ^)+ 2 ^5a 2 logn 



ka 



Thus, by choosing r 2 



5<7 2 fclog(n) 



, we get that for any e > 0, 



sup max 



logft <?0 e ,^ - logKT' e l (r 1:n |x 1: „) 



fc + e 
< — - — log n 



Fg* -eventually almost surely, and (6 1 holds for any b > — . ■ 

D. Poisson emissions 

Now the conditional distribution of Y given X = x is Poisson with mean m x and O e = {d e = (m±, m^) | Vj € X, 
Proposition 3. Asswme (Al-2). If one chooses a > k + 2 in in the penalty (J5|, f„ ~ r* P -eventually a.s. 
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Proof: 

The identifiability of the Gaussian model (A3) has been proved by Teicher in | fT7| , it is easy to see that 
Assumptions (A4) to (A6) hold. The prior 7r™ on e is now defined such that mi, mk are independent identically 



distributed with distribution Gamma(<, 1/2). Then, using 1 1 ] : 



logll^ * - logKTC(y 1:n |zi :n ) L < -log - + kt ° gn + -(1 + tbgt) 
-eventually a.s.. Then, for any fixed t > 0, for any e > 0, eventually almost surely : 



sup max <^ logge e (Yu n \xi;n) - logKT™ (Yi :n |xi : „) > < - + e logn 

=(fni,...,mfc)ee e Il "> ex " I I V 2 



-eventually almost surely, and (6i holds for any b > —. 



IV. Gaussian emissions with unknown variance 



We consider the situation where the emission distributions are Gaussian with the same, but unknown, variance 

and 6„ 

Here 



in ■ 

erf and with a mean depending on the hidden state x. Let r\ = and 9 e 7 = for all j e X = {1, .., k}. 



v, (0. 



e 'Ui=i....,k 



eR,i)<o 



If Xi-.n € X", for any j E X, we set Ij = {i\xi — j} and nj = \Ij\. For sake of simplicity we omit xi-, n in the 
notation though Ij and rij depend on X\ :n . The conditional likelihood is given, for any xi m in X™, for any yx m 
in Y", by 

71 k 

n Me,*** (^) = n exp 77 e ^ + e ^ ~ n i A ^) 



where 



#-2 l0g( -^ 



Theorem 2. Assume (Al-2). If one chooses a > k + 3 in the penalty Q, then f n ~ r*, Pg* - eventually a.s. 

Proof: We shall prove that Theorem [T] applies. First, it is easy to see that Assumptions (A4) to (A6) hold and 
the proof of (A3) can be found in fl6| . 
Define now the conjugate exponential prior on Q e : 



7r"(e?6> e ) = exp 



k k 

*iV + E a W*i ~ E Pj A (V> ~ B «> «a,i. ■ • • > «a,*> ft) 

J'=l 3'=1 



drjdOe i ■■■<!&, 



e.k 



where the parameters a™, («2j)i=i an ^ (/3?)j=i,...,fc will be chosen later, and the normalizing constant may 
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be computed as 



2 



where we recall the Gamma function: T(z) = J +o ° u z 1 e "dw for any complex number z. Theorem [2] follows 
now from Theorem [1] and the proposition below. ■ 



Proposition 4. If (XI) holds, it is possible to choose the parameters a™, (cx2,j)j=i, —,k an d (/3")j=i,...,fc such that 
for any e > 0, 

maxj sup f\ g 6e x ., v (Yi) -logKTe(Yi :n \xu n )\ < k+ l + e logn 

Pg* - eventually a.s. 



Proof: For any X\- n G X", the parameters {0 e ,j)j^j maximizing the conditional likelihood are given by 



2a?, ' e ' J 3? 



■ 2 



with 



so that 



^2ieii Y 



3=1 ieij 



log JJ 5^,^,77 < -«log^ 1: „ - ^l0g27T- -. 



i=l 



Also, 

KTg(yi :n |afi ir 



2tt 



■ exp 



B( Y*, (al J+ J2 Yi)i<j<ki (ft+ni)i<i=i<k ) -B « (P?)i<i<k) 



i=l 



Recall that for all z > (see for instance [18|) 

\/27re _3 z* _ s < T(z) < \Z2^e" z+ Ti?z 2_ 3 



DRAFT 



September 15, 2011 



13 



so that one gets that, for any x 1: „ € X" and any 9 e € Q e 



logll^e,^,, (Yi) ~ logKTJdft^lari^) < o(logn) - - log3* l!B - - (1 + log 2) + - log Z ? =1 J 



i=l 



log 



Ej=i z 3 " + fc + 2 



V i=i j=i 



n 3 - + /E 



Choose now 



Then one easily gets that for any x\- n € X n and any # e € B e 

log]J 3e e , x -,r, (ii) - logKT"(Yi : „|xi : „) < o(logn) 



E}=i Pj + k + 2 



log 1 



fc+i 

3=1 L 



rij + 1/n 



TYl x j 

n.rij + 1 n z rij + n 



HI, fe/n + fc + 2, _ ? 
lo S n + o lo S CT x 1: „ 



(7) 



Let now |F|(„) = maxi<j< n Then for any xi :n E X™, 



K Un ^ \ Y \U and \m Xl .. n ,j\ < \Y\ {n) , j = 1, 



Also, for any partition (i^, . . . , !&) of R in k intervals, define : 



k n 



( 



l 



i=l »=1 



n \ 



E J-ri/e/j 

i'=l 



and 



i=i 

where Varg* € Ik) is the conditional variance of Y\ given that Y\ G The k-means algorithm, see fl9) , 
| [20[ , allows to find a local minimum of the function xi-n — > f?^ starting with any initial configuration xi- n . 
Each step of the algorithm produces an assignment of the values Yi :n in k clusters (by partitioning the observations 
according to the Voronoi diagram generated by the means of each cluster). Here, the values Y\ :n being real numbers, 
a Voronoi diagram clustering on R is nothing else than a clustering by intervals. Because the k-means algorithm 
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converges, in a finite time, to a local minimum of the quantity xi m — > 0%. , if the initial configuration is the 
x1. n that minimizes a\ , the k — means algorithm will lead to the same configuration x\. n . Thus, the minimum 
of ct^ is a clustering by intervals, that is 



inf cr„ 



inf a? 



where the infimum is over all partitions of R in k intervals. 
We now get: 

n 

9e s , x .,v (Yd - logKT"(Yi : „|a;i:„) < o(logn) 

i=l 



n + £ Pj + k + 2 



log 1 



fe+1 



log n 



k/n + k + 2 



\ inf 



log\Y\f n) 



k + 1 



3=1 



Y\ 



(n) 



rij + 1/n J n.rij + 1 



\Y\(n) 



and Proposition |4] follows from the choice |7]) and the lemmas below, whose proofs are given in the Appendix. ■ 
Lemma 1. ^fCAl) holds, 

sup/. Ik j — cr|. f converges to as n tends to infinity P#* - a.s. (Here the supremum is over all 

partitions of R /« fc intervals). Also, the infimum s; n f of Oj ( Jfc over aZZ partitions o/R w intervals satisfies 
s inf > 0. 

Lemma 2. //"(Al) /joW^, Pe* - eventually a.s. , < 50-^logn. 

V. Algorithm and simulations 

In this section we first present our practical algorithm. We then apply it in the case of Gaussian emissions with 
unknown common variance and compare our estimator with the BIC estimator that is when we choose in Q the 
BIC penalty pen(n,r) = ^^|r| logn. 

A. Algorithm 

We start this section with the definition of the terms used below : 

• A maximal node of a complete tree r is a string u such that, for any x in X, ux belongs to r. We denote by 
N(t) the set of maximal nodes in the tree r. 

• The score of a complete tree r on the basis of the observation (Yy, . . . , Y n ) is the penalized maximum likelihood 
associated with r : 

sc(t) = - sup log ge(Yi in ) +pen{n,r) (8) 
eee T 
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We also require that the emission model belongs to an exponential family such that : 

(i) There exists D G N*, a function s : X x Y — > R D of sufficient statistic and functions h : X x Y — » M, 
ijj : 6 e — >• M. D , and ^4 : 8 e — > M, such that the emission density can be written as : 

9e e , x ,r,{y) = h(x, y) cxp [(V>(0 e ), s(x, j/)) - A(0 e )] 

where (., .) denotes the scalar product in R D . 

(ii) For all S G R D , the equation : 

Ve B i>(6 e )S- Ve e A(0e) = 
where Ve c denotes the gradient, has a unique solution denoted by 9 e (S). 

Assumption (ii) states that the function 9 e : S G M 13 — > e (<5) G e that returns the complete data maximum 
likelihood estimator corresponding to any feasible value of the sufficient statistics is available in closed-form. 

The key idea of our algorithm is a "bottom to the top" pruning technique. Starting from the maximal complete 
tree of depth M = [lognj, denoted by tm, we change each maximal node into a leaf whenever the resulting tree 
decreases the score. 

We then need to compute the maximum likelihood of any complete tree subtree of tm- We start the algorithm 
by running several iterations of the EM algorithm. During this preliminary step we build estimators of sufficient 
statistics. These statistics will be used later in the computation of the maximum likelihood estimator 8 T G O r 
which realizes the supremum in <[8j for any complete context tree r subtree of tm- 

For any n > 0, we denote by W n the vectorial random sequence W n — (X„_m+i, ■ • ■ , X n ). For n big enough, 
M > d(r*) and (W n ) n is a Markov chain. The intermediate quantity (see p7| ) needed in the EM algorithm for 
the HMM (W n ,Y n ) can be written as: 
for any (9, 0') in Q TM : 

Q ,e> = E e ,(log(gg(W 1:n ,Y 1:n ))\Y 1:n ) 

n-l 

= E el {v(Wx)\Y Un ) + J2 EyQogP gt (Wi, W i+1 )\Y 1:n ) 

n 

+ Y,Ee>{\oggo B>WiM ^Y i )\Y lM ). 
i=i 

Notice, for any 6 G TM , if (w,w') G (X A/ ) 2 are such that w 2 :M ^ w/i : m-i> then Pg t (w,w') = 0. 
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For any w € X and any w' £ X if we denote by 



and 



V i = 1, . . . , n, $*|„H = P g/ {W l = w\Y 1:n ), 

v i = i, . . . ,n - 1, $^ +1 |„0, «/) - P '(Wi = u>, vf 2+1 = w '|r 1: „), 



5, 



sex ;=i \u>ex M |™ M =2; / 



then there exists a function C such that 



-Qe,e> = -C(0',Y 1:n ) + (s( n ,\ogP et ) + (s e e [ n ^(8 e )) - A(8 e ). 



(9) 



If, for some complete tree r, we restrict 8 t in 0t. T , then for any s in r, for any w in X 711 such that s is postfix 

of w, for any a; in X, we have Pg t (w, (u^m^)) = P s ,x{8t)- 



Qe,9>+ME P s .,'-l) 

x'ex 



Thus, the vector P fli . maximising this equation is solution of the Lagrangian, 

= , Vx e X 

-Qe^+A( £ P^-l) =0 

and, finally, the estimator of 9 t £ <dt,r maximising the quantity Q(9' , .) only depends on the sufficient statistic Sf n 
and is given by : 



5 


"1 


SPs, X 


n 


5 


: 1 


IX 


n 



Ps,x{St.n) 



E sf^^iK^)) 

wGX M | s postfix o/ it; 

E E Sf' n (w,(w 2 :MX'))' 

x'eX to£X M | s postfix of w 



(10) 



While Algorithm [I] computes the sufficient statistics S t and S e on the basis of the observations (Xfc)fce{i,...,n}> 
Algorithm [2] is our pruning Algorithm. This algorithm begins with the estimation of the exhaustive statistics calling 
Algorithm [T] As Algorithm [T] is prone to the convergence towards local maxima, we set our initial parameter value 
#o after running a preliminary k-means algorithm (see fl9) , fl20)): we assign the values Y 1:n into k clusters which 
produces a sequence of "clusters" X\ :n . A first estimation of the emission parameters is then possible using this 
clustering, the initial transition parameter 0o,t — Ki)„ eX M ieX * s a ^ so com P u ted on the basis of the sequence 
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Algorithm 1 Preliminary computation of the sufficient statistics 

Require: 6> = (9t,o,Oe,o) € TM be an initial value for the parameter 0. 
Require: Let tEM be a threshold. 

l: stop = 

2: i = 

3: while (stop = 0) do 

4: i = i + 1 

5: M step : compute the quantities S^ 1 and S e ]n 

6: E step : set 

7: if fls—ilj <i £m ) then 

8: stop = 1 

9: end if 

10: end while 

11: M step : compute the quantities Sp n and S® l n 

12: S t = Sf; n and S e = S% 

13: return {S u S e ) 



Xi-.n using the relation : 

n-M 

V 1 1 - 

£-< X i:i+M _ 1 =-w X i+M =x 

Vw e X M Vx e x, i^ ia = — . 

V 1 - 

1=1 

Then, starting with the initialisation r = rjf, we consider, one after the other, the maximal nodes u of r. We 
build a new tree r te st by taking out of r all the contexts s having u as postfix and adding u as a new context: 
Ttest =r \ {ux\ux e t, x E X} (J {«}. Let t est = (((P s , a: (St)) se T trat ,seX ! 0eOS'e)) which, hopefully, becomes an 
acceptable proxy for argxnaxlogpe(li.„). Let — log gg (Yi n ) +pen(n, r test ) be an approximation of the score of 
the context tree r test still denoted by sc(r test ), then, if sc(T tes t) < sc(t), we set r = r tes t. In Algorithm the role 
of t 2 is to insure that all the branches of r are tested before shortening again a branch already tested. 

B. Simulations 

We propose to illustrate the a.s convergence of f n using Algorithm [2] in the case of Gaussian emission with 
unknown variance. We set k = 2, and use as minimal complete context tree one of the two complete trees 
represented in Figure [T] and Figure [2] The true transitions probabilities associated with each trees are indicated in 
boxes under each context. 

For each tree t* and t%, we will simulate 3 samples of the VLHMM, choosing as true emission parameters 
TOq = 0, a 2 '* = 1 and m* varying in {2,3,4}. In the preliminary EM steps, we use as threshold t^M = 0.001 

The results of our simulations are summarized in Tables [!] to |IV| The size of the estimated tree |f„| for different 
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Algorithm 2 Bottom to the top pruning algorithm 



Require: Let tEM a threshold. 

Compute (St,S e ) with Algorithm [T] with the tEM threshold. 

Pruning procedure : 

T = T 2 = T M 

change = YES 

while (change = YES AND |r| > 1) do 

change = NO 
for (it G N(t)) do 
if (u G iV(r 2 )) then 

L u {t 2 ) = {s G r 2 |u postfix of s} 
Ttest = [r 2 \ i„(r 2 )] U {u} 

4 rt =((P.,x(S t )). 6Tja6X) fle(5e)) 

if (sc(r tes t) < sc(t 2 )) then 

T 2 = ^ r test 
^ = fltest 

change = YES 
end if 
end if 
end for 

T = T 2 

end while 
return r 




L L — — 11. 11 

G 6 I E I E I 3 3 77 



Figure 1: Graphic representation of the complete context tree t\ with transition probabilities indicated in the box 
under each leaf s: P* \ P* 1 



values of n and m\ are noticed in Table [I] when r* 

DRAFT 
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Figure 2: Graphic representation of the complete context tree with transition probabilities indicated in the box 
under each leaf s: P* \ P* x 
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Table I: Case r* = t*. Comparison of |f n | between our estimator and the BIC estimator for different values of n 
and m*. 



M (k — l)t + a k — 1 

two choices of penalties pen a (n,T) = logn with a = 5.1 and pen(n,T) — |r|logn. The 

t=i 2 "2 

first important remark we make regarding Tables [I] and III is that, on each simulation and whatever the penalty we 
used, when |f„| = |r*| we also had f„ = r*, in the same way, each time |f„| < |r*| (resp. |f„| > |r*| ), f„ 
was a subtree of r* (resp. t* was a subtree of f„). For any combination of t* and m*, both estimators seem to 
converge, except our estimator in the case t* = and m\ = 2, where 50 000 measures is not enough to reach 
the convergence. However, for small samples, smaller models are systematically chosen with our estimator, while 
the BIC estimator is reaching the right model for relatively small samples. This behaviour of our estimator shows 
that our penalty is too heavy. 

The score differences sc(f n ) — sc(t*) Table [I!] when r* = r* and Table IV when t* = are the differences 
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T * = t*, |r*| =6 
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Table II: Case t* = r*. Score difference sc(f„) — sc(r*). 
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Table III: Case t* = t£ . Comparison of |f n | between our estimator and the BIC estimator for different values of 
n and m\. 
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Table IV: Case t* = t*. Score difference sc(f„) — sc(t*). 
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between the score of f n computed with the estimated parameter 9 n and the score of t* computed with the the real 
parameters. These informations allow us to know when the estimators f n ,6 n are well estimated by Algorithm]^ 
Indeed, when f„ ^ r*, if the score of t* computed with the real transition and emission parameters is smaller 
than the score of our estimator with estimated parameters (non negative score difference), then the estimator given 
by Algorithm [2] is not the expected estimator defined by Q. In particular, Table [H] shows that the over estimation 
of the BIC estimator in the case m\ = 2 (Table |ll| can be due to a local minima problem: Algorithm [2] selected 
a tree r such that |r| > |r*| whereas t* had a smaller score. This problem might occur because we use an EM 
type algorithm which often leads to local minima. Although we try to take an initial value of the parameters in a 
neighbourhood of the real ones using the preliminary k-means algorithm, this problem persists. Extra EM loops 
for each tested tree in Algorithm [2] could also provide a better estimation of the parameters and then improve the 
score estimation for each tested tree, but it would also increase the complexity of the algorithm. 

Finally, we observe that bigger the quantity \m,Q — m*\ is, quicker the convergence of our estimator or BIC 
estimator occurs. This phenomenon can be easily understood as very different emission distributions for different 
states leads to an easier estimation of the underlying state sequence on the basis of the observations and allows us 
to build a more precise description of the VLMC behaviour. 



VI. Conclusion 



In this paper, we were interested in the statistical analysis of Variable Length Hidden Markov Models (VLHMM). 
We have presented such models then we estimated the context tree of the hidden process using penalized maximum 
likelihood. We have shown how to choose the penalty so that the estimator is strongly consistent without any 
prior upper bound on the depth or on the size of the context tree of the hidden process. We have proved that our 
general consistency theorem applies when the emission distributions are Gaussian with unknown means and the 
same unknown variance. We have proposed a pruning algorithm and have applied it to simulated data sets. This 
illustrates the consistency of our estimator, but also suggests that smaller penalty could lead to consistent estimation. 
Finding the minimal penalty insuring the strong consistency of the estimator with no prior upper bound remains 
unsolved. A similar problem has been solved by R. van Handel J7) to estimate the order of finite state Markov 
chains, and by E. Gassiat and R. van Handel (8) to estimate the number of populations in a mixture with i.i.d. 
observations. The basic idea is that the maximum likelihood behaves as the maximum of approximate chi-square 
variables, and that the behavior of the maximum likelihood statistic may be investigated using empirical process 
theory tools to obtain a log log n rate of growth. However, it is known for HMM that the maximum likelihood does 
not behave this way and converges weakly to infinity, see (9}. We did by-pass the problem by using information 
theoretic inequalities, but understanding the pathwise fluctuations of the likelihood in HMM models remains a 
difficult problem to be solved. 
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Appendix A 
Proof of LemmaQ] 



For any partition (Jj, . . . , If.) of R in k intervals, 

k 

4,..,/, = J2 Fe *( Y i e h)Varg,{Y l \Y 1 e J fc ) 

3=1 

>~ inf Varo^ilyi e J) 

& /:P 8 .(y6i)>l 

where the infimum is over all intervals I of R. The distribution of Yi is the Gaussian mixture with density 

g* = T*(s)^ m « iff 2, where 7r* is the stationary distribution of (X n ) n >Q and <fi m * a i is the density of the normal 

xex 

distribution with mean m% and variance er 2 . The repartition function F* of the distribution of Y\ is continuous and 
increasing, with continuous and increasing inverse quantile function. Thus, 

inf al r> inf Var e * (Y 1 |Y a e]a, &[). 

I i ,...,I k *'"*' -oo<a<f,< + oo: 

f*( a)+ i<F*(h) 

But Varg* (Yi|y e]a, 6[) is a continuous function of (a, 6), and the infimum at the righ-hand side of the inequality 
is attained at some (a,b) (eventually infinite) such that F*(a) + 4 < F*(b). Thus Va7-g*(Yi|Yi 6]a,6[) > 0, and 

Sm/ > 0. 

For any partition (/,-,..., /&) of R in fc intervals, 

, « k ( (E^(y,)) 2 ^ 



3 = 1 



i=l 



\ 



J 



so that 



SU P |o'l ll ...,/ fc ( i l:n)- CT / i ,...,/ s 



< 



-fc sup 

/ interval of 



(E^i/W)) 2 

i=i 

2 ~ " E(U(Yi)) 



E WO 



i=l 



Using [15], (Yn)n>o is a stationary ergodic process, so that ^ E ^ 2 ~ ^(^l 2 ) tends to Pg* a.s. Let e > 0. We 

i—l 

now consider separately the intervals I such that E(lj(Y)) < e or _E(1/(Y)) > e. 
• Let 7 be such that E(1 I (Y 1 )) < e. 
Using Cauchy Schwarz inequality, 

2 



;EH s(;E>?M«))x(i5: i,w) 
e (Yii/(Yi)) 2 < £ (yi 2 i/(yi)) b (i/(y)) 



and, 



E (y^l/CYi)) < J E{Yf)^fEl^{y\)j < Myfe 
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for some fixed positive constant M. Thus, 

(EILiW^)) 2 



E(Y 1 1 I (Y 1 )) 2 



EHi^) E(li(Yi)) 

1 " 

i=i 

- ]T ifi^y) - ^y^y)) + 2^(y 1 2 i(y 1 )) 

1 ™ 

- V ^lj^) - S(n 2 l(n)) + 2MV~e. 
n 

Let now / be such that E(1 I (Y 1 )) > e. 



< 



< 



E"=i 1 i( Y i) E(li(Yi)) 





/ELii/W) 




/ n 


ELi^W) 


1 

, + 




/Er=ii/(vi) 



£(yl/(yl)) 



< 



+ 



sr=i wis) 



V n 



^yi^y)) 
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Now, using Lemma [3] below, one gets that, for all positive e, 

£(iUi(Yx)) 2 (ELiW^)) 2 



lim sup sup 

n— >oo J interval of R 



9* -a.s. so that 



lim sup 

n— s-oo j interval D f 



-a.s. and the Lemma follows. 



< 2M^i 



E?=i W) 



Lemma 3. su P/ | ± E - E (if lj(y))|, su P/ 1 1 £ 1^(1^) - £ (Yilz(Yi))| and su Pl | i E Wi) - £(W0 

(where the supremum is over all intervals I in K.J fena! fo as n tends to infinity, Fg* a.s. 

Proof: Let us note T a = {x — > x a l/(a;) : / interval of M} for a = 0,1,2. Since the sequence of random 
variables (F n )„>o is stationary and ergodic, it is enough to prove that, for a = 0,1,2, for any positive e, there 
exists a finite set of functions P a such that for any / G T a , there exists I, u in T a such that I < f < u and 

- TO) < e. 

r 1 

For the cases a=0 or 2 and for any positive e, there exist real numbers : L a e and e such that J_"' e x a g*(x)dx < e 



and J L 2 x a g*{x)dx < e, and there exists real numbers a; a> i = L\ < x a ,2 < ■■■■ < x a .N a e -2 < L 2 a e = x a 
such that J^ a,i+1 x a g*{x)dx < e/2, i = 1, 



. for any i = 1, ...,N a>e , I x a l = [-co , x 0)i ] 
. and for any i = 1, ...,-/V a , £ , I 2 ; = [x a<i , oo] 

so that if 2 a is the set l a = ^I 3 a i \i = 1, ...,N a>e , j = 1, 2| U {[x tt)jl , £ a ,i 2 ]} il<i2 the set J" a = {x a l/|/ € 2 a } 
verifies the above conditions. 

For the case a — 1 the construction of the sequence x a> \ = L\ < x a: 2 < ■■■■ < x^ a c -2 < L 2 c = XN a e -i is such 



that |x|g*(a;)da; < e/2 is similar except that we introduce in the sequence : 



Appendix B 
Proof of Lemma[2] 



Let t n = 5<7 2 log n. One has 
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where M = maxj=i t m* and U is a Gaussian random variable with distribution jV(0, 1). Then, for large enough 

n : 
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and the result follows from Borel Cantelli Lemma. (T7) 
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